To reduce fluctuation of random sampling effect especially at bootstrap phase, N-step reward (discounted summation) are useful. By expanding Bellman equation, a N-step target of Q function becomes \(\sum _{k=0}^{N-1} \gamma ^k r_{t+k} + \gamma ^N \max _{a} Q(s_{t+N},a)\).
According to W. Fedus et al., N-step reward can utilize larger buffer more effectively. Even though theoretically N-step reward, which is based on a policy at exploration, is not justified for off-policy, it still works better.
You can create N-step version replay buffer by specifying Nstep
parameter at constructors of ReplayBuffer
or
PrioritizedReplayBuffer
. Without modification of its environment,
cpprb summarizes N-step rewards and slides “next” values like next_obs
.
Nstep
parameter is a dict
with keys of "size"
, "rew"
,
"gamma"
, and "next"
. Nstep["size"]
is a N-step size and 1-step
is identical with ordinary replay buffer (but
inefficient). Nstep["rew"]
, whose type is str
or array-like of
str
, specifies the (set of) reward(s). Nstep["gamma"]
is a
discount factor for reward summation. Nstep["next"]
, whose type is
str
or array like of str
, specifies the (set of) next type
value(s), then sample
method returns (i+N)-th value instead of
(i+1)-th one.
sample
also replaces "done"
with N-step version.
cpprb v10 no longer returns \(\gamma ^{N-1}\) since users can always multiply fixed \(\gamma ^N\).
Since N-step buffer temporary store the values into local storage, you
need to call on_episode_end
member function at the end of the every
episode end to flush into main storage properly.
Parameters | Type | Description |
---|---|---|
size |
int |
Nstep size |
rew |
str or array-like of str |
Nstep reward |
gamma |
float |
Discount factor |
next |
str or array-like of str |
Next items (e.g. next_obs) |
Return Value | Replace: From \(\to\) To |
---|---|
Next items (e.g. next_obs) | \(s_{t+1} \to s_{t+N}\) |
rew |
\(r_t \to \sum _{n=0}^{N-1} \gamma ^n r_{t+n}\) |
done |
\(d_t \to 1-\prod _{n=0}^{N-1} (1-d_{t+n})\) |
import numpy as np
from cpprb import ReplayBuffer
nstep = 4
gamma = 0.99
discounts = gamma ** nstep
rb = ReplayBuffer(32,{'obs': {"shape": (4,4)},
'act': {"shape": 3},
'rew': {},
'next_obs': {"shape": (4,4)},
'done': {}},
Nstep={"size": nstep,
"gamma": gamma,
"rew": "rew",
"next": "next_obs"})
for i in range(100):
done = 1.0 if i%10 == 9 else 0.0
rb.add(obs=np.zeros((4,4)),
act=np.ones((3)),
rew=1.0,
next_obs=np.zeros((4,4)),
done=0.0)
if done:
rb.on_episode_end()
sample = rb.sample(16)
nstep_target = sample["rew"] + (1-sample["done"]) * discounts * Q(sample["next_obs"]).max(axis=1)
This N-step feature assumes sequential transitions in a trajectory (episode) are stored sequentially. If you utilize distributed agent configuration, you must add a single episode simultaneously.